Hardware-Efficient Attention for Fast Decoding
Optimizing Latency and Throughput in LLM Decoding with GTA & GLA
Authors: T. Zadouri et al.
Published on Arxiv: 2025-05-27
Link: http://arxiv.org/abs/2505.21487v1
Institutions: Department of Computer Science, Princeton University • Princeton Language and Intelligence, Princeton University
Keywords: hardware-efficient attention, KV cache, arithmetic intensity, Grouped-Tied Attention (GTA), Grouped Latent Attention (GLA), Multi-head Latent Attention (MLA), Grouped-Query Attention (GQA), tensor parallelism, paged KV, FlashAttention3, speculative decoding, latency, throughput, LLM inference, FineWeb-Edu-100B, NVIDIA H100, memory-bound workload
Large language model (LLM) decoding faces performance bottlenecks due to high memory bandwidth requirements for KV cache retrieval, especially as context length and batch size grow. Existing attention mechanisms, like Multi-Head Attention (MHA), are increasingly suboptimal as compute capabilities outpace memory bandwidth improvements, leaving modern inference workloads memory-bound and limiting acceleration potential on new hardware.
To address these hardware challenges, the authors propose and evaluate novel attention mechanisms designed to maximize arithmetic intensity and parallelizability without quality loss:
- Introduce Grouped-Tied Attention (GTA), which consolidates key and value states per group, reducing KV cache needs and enhancing arithmetic intensity.
- Develop Grouped Latent Attention (GLA), enabling parallel-friendly latent attention through latent head grouping and GPU-level optimizations, including memory and kernel improvements suitable for NVIDIA H100 GPUs.
- Conduct comprehensive experiments across multiple model sizes on FineWeb-Edu-100B, benchmarking against established attention variants (MHA, MQA, GQA, MLA) and implementing system-level enhancements like software pipelining and distributed offset calculation.
Following these innovations, the authors perform quantitative and qualitative benchmarks to demonstrate the advantages:
- Show that GTA matches Grouped-Query Attention (GQA) in model quality, while halving KV cache requirements and doubling compute efficiency for the same group settings.
- Demonstrate that GLA equals Multi-head Latent Attention (MLA) in quality but achieves up to 2× faster speculative decoding and maintains or improves downstream accuracy, e.g., GLA-2 scoring 60.0% vs. MLA’s 59.1% on XL models.
- Highlight system-level results: GLA reduces live serving latency and doubles throughput, while optimized kernels operate near hardware limits (up to 93% of GPU memory bandwidth, 70% of peak TFLOPS).
Synthesizing these results, the study draws its main conclusions:
- Emphasizing arithmetic intensity and parallelism yields substantial hardware efficiency gains in LLM decoding.
- GTA offers a compelling GQA replacement, saving on KV cache memory, while GLA provides a scalable and markedly faster MLA alternative for current hardware environments.
- The provided open-source kernels and approaches facilitate more practical, scalable, and high-quality LLM inference, with open directions for scaling further and adapting to other architectures.